CIDO RESEARCH
1 Intoduction
This report is one of 493,845 that I will make, and one of 104,070,413 that could be made.
I “toke” the 1.4 TB Linked-In data that was breached in 2020, and turned it into some insights to power my job HUNT.
The insights I could share in this report, that are also related to my goals, are:
- Industry base recruitment trend.
- Company base workforce timeline.
- Current/part workforce info:
- Basic info: Name, job title, status, social link. I could add geo-location for some that have the data, but it would look creepy.
- Their work period.
- Their experiences.
2 About me
Salutations; I’m Joseph, a self-taught data analyst, engineer, and scraper.
Despite life’s challenges, my goal remains a remote job, either full or part-time, and having friends to tackle the challenges of this changing world with.
To show my skills and dedication, I made this project that yielded this tailored report.
3 About the project
3.1 How this project comes to life?
You would know by now, from my email, that I am hunting for a job.
About a year ago, I scraped contact info from Google Map to get my first job. Later I scraped contact from Linked-In website… you can check how that went in here.
Recently, I finally got to learning SQL because of DuckDB, it is a software that allows you to process big data in your local machine by using storage space as RAM; Then I remembered about a leaked Linked-In data that I couldn’t process.
Thus my journey started to learn SQL, process the data, and make something out of it.
3.2 The process
The process was done in my local machine, and it was as followed.
3.2.1 Downloaded the leaked data
I downloaded the data from a torrent.
There was around 700 .gz file, each is around 280 Mb; 196 GB in total.
Each .gz file contain a 2 GB file; 1.4 TB in total.
Each file have multiple lines, and each one of them is a JSON; Not the file is a JSON, it just have multiple JSONs, one in each line.
3.2.2 Processing the weird data
I this phase I created a script that automatically open an archive, process the file, and save it as a Parquet file with compression level of 22.
I used Python, Pathlib, Polars, and a lot of patience.
The process toke around 20 minutes per file, in total it toke around three weeks (I had to shutdown my PC at night) The result was 700 parquet files, each is around 190 Mb; 133 GB in total.
3.2.3 Making relational database
The data in the datasets were nested, especially the “experience” field, it had the experience of a person and the company info; The problem is that the company info get repeated multiple tiles, across all datasets.
Making a relational database will solve this, and make the exploratory data analysis easier.
The code was split in two:
1. I used Polars to split each of the 700 datasets into mini relational databases.
2. I used DuckDB to merge all the mini relational databases and remove duplicates in some, mainly company and university information’s.
The result was a relational database that is 73 GB in size; From 1.4 TB to 73 GB.
All of this is using my PC, so servers were harmed, only my CPU fan and my ear.
3.2.4 Filter
I filtered out companies base on their industry, country, and whether I have the email of one of the higher ups.
4 General graphs
4.1 market research indestry’s yearly new recruit count
4.2 cido research’s workforce status over the years
5 Workforce sample
5.1 A Saad Imran
Job title: ****
Associated: True
Socials: https://linkedin.com/in/a-saad-imran-b3559895
5.1.1 A Saad Imran’s working period at cido research
5.1.2 Gantt plot of A Saad Imran’s experience
5.2 Anamika Patnaik
Job title: Qc
Associated: True
Socials: https://linkedin.com/in/anamikapatnaik
5.2.1 Anamika Patnaik’s working period at cido research
5.2.2 Gantt plot of Anamika Patnaik’s experience
5.3 Cessandra Latinovich
Job title: Field director
Associated: True
Socials: https://linkedin.com/in/cessandra-latinovich-b5b93b6b
5.3.1 Cessandra Latinovich’s working period at cido research
5.3.2 Gantt plot of Cessandra Latinovich’s experience
5.4 Chris Liu
Job title: Tourism market researcher
Associated: False
Socials: https://linkedin.com/in/chrisliupm
5.4.1 Chris Liu’s working period at cido research
5.4.2 Gantt plot of Chris Liu’s experience
5.5 Claus Günnewig
Job title: Head of back office
Associated: True
Socials: https://linkedin.com/in/claus-günnewig-918a08179
5.5.1 Claus Günnewig’s working period at cido research
5.5.2 Gantt plot of Claus Günnewig’s experience
5.6 Eileen Chen
Job title: Market research analyst
Associated: False
Socials: https://linkedin.com/in/eileenlc | https://linkedin.com/in/eileen-l-chen-30908363 | https://linkedin.com/in/lin-chen-30908363
5.6.1 Eileen Chen’s working period at cido research
5.6.2 Gantt plot of Eileen Chen’s experience
5.7 Erman Akcay
Job title: Market research intern
Associated: False
Socials: https://linkedin.com/in/erman-akcay-394a9447
5.7.1 Erman Akcay’s working period at cido research
5.7.2 Gantt plot of Erman Akcay’s experience
5.8 François-Charles Humbert
Job title: Bilingual customer service representative
Associated: False
Socials: https://linkedin.com/in/franã§ois-charles-humbert-bb12455a | https://linkedin.com/in/françois-charles-humbert-bb12455a
5.8.1 François-Charles Humbert’s working period at cido research
5.8.2 Gantt plot of François-Charles Humbert’s experience
5.9 Gustavo Bolaños
Job title: Quality assurence analyst and analista de control de calidad
Associated: False
Socials: https://linkedin.com/in/gustavo-loría-bolaños-7301b265
5.9.1 Gustavo Bolaños’s working period at cido research
5.9.2 Gantt plot of Gustavo Bolaños’s experience
5.10 Igor Pawlenko
Job title: Research interviewer
Associated: False
Socials: https://linkedin.com/in/igor-angelo-pawlenko-89a02b50
5.10.1 Igor Pawlenko’s working period at cido research
5.10.2 Gantt plot of Igor Pawlenko’s experience
5.11 Jackee Wong
Job title: Translator and market researcher
Associated: False
Socials: https://linkedin.com/in/jackeewong | https://linkedin.com/in/jackee-wong-b6887878
5.11.1 Jackee Wong’s working period at cido research
5.11.2 Gantt plot of Jackee Wong’s experience
5.12 Mahwash Tasawar
Job title: Associate
Associated: True
Socials: https://linkedin.com/in/mahwash-tasawar-17710965
5.12.1 Mahwash Tasawar’s working period at cido research
5.12.2 Gantt plot of Mahwash Tasawar’s experience
5.13 Natthakan Jeengao
Job title: Phone interviewer
Associated: True
Socials: https://linkedin.com/in/natthakan-jeengao-1416a3153
5.13.1 Natthakan Jeengao’s working period at cido research
5.13.2 Gantt plot of Natthakan Jeengao’s experience
5.14 Oi Chan
Job title: Clerical assistant
Associated: True
Socials: https://linkedin.com/in/oi-hei-chan-6a4458166
5.14.1 Oi Chan’s working period at cido research
5.14.2 Gantt plot of Oi Chan’s experience
5.15 Patrick Kiplagat
Job title: Senior data processing analyst
Associated: True
Socials: https://linkedin.com/in/pkiplagat | https://twitter.com/kiplagat
5.15.1 Patrick Kiplagat’s working period at cido research
5.15.2 Gantt plot of Patrick Kiplagat’s experience
5.16 Rose Baker
Job title: Phone interviewer
Associated: False
Socials: https://linkedin.com/in/rose-baker-ostiguy-2197169b | https://facebook.com/rosebakerlove
5.16.1 Rose Baker’s working period at cido research
5.16.2 Gantt plot of Rose Baker’s experience
5.17 Sam Ho
Job title: English team manager
Associated: False
Socials: https://linkedin.com/in/sam-ho-b73b0169
5.17.1 Sam Ho’s working period at cido research
5.17.2 Gantt plot of Sam Ho’s experience
5.18 Sandy Au-Yeung
Job title: Assistant data collection manager
Associated: True
Socials: https://linkedin.com/in/sandy-au-yeung-197556bb
5.18.1 Sandy Au-Yeung’s working period at cido research
5.18.2 Gantt plot of Sandy Au-Yeung’s experience
5.19 Suki Chan
Job title: Interviewer
Associated: True
Socials: https://linkedin.com/in/suki-chan-28752a95
5.19.1 Suki Chan’s working period at cido research
5.19.2 Gantt plot of Suki Chan’s experience
5.20 Thomas Zurowski
Job title: Telephone interviewer
Associated: True
Socials: https://linkedin.com/in/thomas-zurowski-144090116
5.20.1 Thomas Zurowski’s working period at cido research
5.20.2 Gantt plot of Thomas Zurowski’s experience
5.21 Vindya Seneviratne
Job title: Bilingual interviewer
Associated: False
Socials: https://linkedin.com/in/vindya-seneviratne-2072b423 | https://facebook.com/vindyas
5.21.1 Vindya Seneviratne’s working period at cido research
5.21.2 Gantt plot of Vindya Seneviratne’s experience
5.22 Wah Chiu
Job title: ʹé
Associated: True
Socials: https://linkedin.com/in/wah-chiu-4a7996173
5.22.1 Wah Chiu’s working period at cido research
5.22.2 Gantt plot of Wah Chiu’s experience
5.23 Zore Fernández
Job title: Panel support
Associated: True
Socials: https://linkedin.com/in/zore-fernández-361979a5